DCEP -Digital Corpus of the European Parliament

نویسندگان

  • Najeh Hajlaoui
  • David Kolovratník
  • Jaakko Väyrynen
  • Ralf Steinberger
  • Dániel Varga
چکیده

We are presenting a new highly multilingual document-aligned parallel corpus called DCEP Digital Corpus of the European Parliament. It consists of various document types covering a wide range of subject domains. With a total of 1.37 billion words in 23 languages (253 language pairs), gathered in the course of ten years, this is the largest single release of documents by a European Union institution. DCEP contains most of the content of the European Parliament's official Website. It includes different document types produced between 2001 and 2012, excluding only the documents already exist in the Europarl corpus to avoid overlapping. We are presenting the typical acquisition steps of the DCEP corpus: data access, document alignment, sentence splitting, normalisation and tokenisation, and sentence alignment efforts. The sentence-level alignment is still in progress but based on some first experiments; we showed that DCEP is very useful for NLP applications, in particular for Statistical Machine Translation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tagging a Corpus of Interpreted Speeches: the European Parliament Interpreting Corpus (EPIC)

The performance of three different taggers (Treetagger, Freeling and GRAMPAL) is evaluated on three different languages, i.e. English, Italian and Spanish. The materials are transcripts from the European Parliament Interpreting Corpus (EPIC), a corpus of original (source) and simultaneously interpreted (target) speeches. Owing to the oral nature of our materials and to the specific characterist...

متن کامل

Multilingual Corpora for Cooperation

MLCC was a corpus, acquisition project funded by the EC Telematics program.The aim was to collect a set of texts representing a substantial improvement in range, quantity and quality of corpus material available. Two sub-corpora have been defined to help meet the needs for multilingual data consisting of a comparable set of texts in six languages and a parallel set of data in 9 languages. The c...

متن کامل

The Role of Political Parties in Empowering Women’s Positions in the UK Parliament

Women’s political representation has been for decades ahead of women’s rights movement. However, women presence in politics is not only limited to women political participation but their position in party politics as well. This paper aims to analyze the role of parties in empowering women at the UK parliament and tries to contribute towards the existing literature through presenting an interdis...

متن کامل

Tweeting Europe: A text-analytic approach to unveiling the content of political actors’ Twitter activities in the European Parliament

Twitter is an important platform for communication and is frequently used by Members of the European Parliament (MEPs) to campaign and engage in discussion with constituents and colleagues in the parliament. Examining the issues that MEPs talk about on Twitter can thus inform us about their political priorities. Topic modelling aims to summarise a corpus of documents by capturing the underlying...

متن کامل

ECPC: el discurso parlamentario europeo desde la perspectiva de los estudios traductológicos de corpus

This paper presents the main outcome of the ECPC research group: an archive of European parliamentary speeches created to study this genre and the hypothetical influence of translation in the construction of European identity. The archive is made up of, on the one hand, a parallel corpus containing the English and Spanish versions of the European Parliament proceedings, and on the other hand, t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014